Feature/improve type annotations by yanurag-dev · Pull Request #15 · Benchwise/benchwise

yanurag-dev · 2025-12-02T18:19:45Z

Summary by CodeRabbit

New Features
- Project-wide static typing via a new central types module and a CI workflow to run linters, type checks, and tests.
Public API
- Extensive explicit type annotations across public interfaces (behavior unchanged).
Tests
- Tests reformatted for consistency; some tests include runnable example safeguards.
Documentation
- README adds a Development section (type safety, running tests, code quality).
Chores
- Upgraded pre-commit hooks, added mypy config and type stubs, updated lint deps and ignore rules.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Add benchwise/types.py with TypedDict, Protocol, and Literal definitions - Add mypy.ini with strict type checking - Add py.typed marker for PEP 561 compliance - Improve type annotations in exceptions, config, and datasets modules

- Update ModelAdapter and all implementations with ModelConfig TypedDict - Add PricingInfo TypedDict for pricing dictionaries - Add proper type annotations to all __init__ and generate methods - Fix type issues in cost estimation calculations

- Add proper Callable type signatures for evaluate, benchmark, and stress_test decorators - Use ParamSpec and TypeVar for generic decorator typing - Add Awaitable type annotations for async functions - Type all **kwargs parameters as Any - Update _run_evaluation with specific tuple[str, ...] type - Add type ignore comments for dynamic _benchmark_metadata attributes - All 124 tests passing (2 pre-existing benchmark test failures unchanged)

…d cli

All mypy errors resolved (82 → 0). Fixed type narrowing, cast() calls for response.json(), and missing return type annotations. Tests: 124/126 passing.

Added GitHub Actions CI workflow, pre-commit hooks, and README development section with type checking info.

coderabbitai · 2025-12-02T18:19:51Z

Walkthrough

Adds a centralized types module and stringent mypy config, applies pervasive static typing and return annotations across the package, introduces a GitHub Actions CI workflow and pre-commit mypy/ruff integration, updates lint dependencies and .gitignore, and performs widespread non-functional formatting and test refactors.

Changes

Cohort / File(s)	Summary
New types module `benchwise/types.py`	Add comprehensive public type surface: TypeVars, ParamSpec, Literal aliases, many TypedDicts for configs/metrics/results, Protocols, CLI arg TypedDicts and related exports.
Type-checking config `mypy.ini`, `.pre-commit-config.yaml`	Add strict mypy.ini targeting `benchwise`; update pre-commit hooks (pre-commit-hooks, ruff) and add mypy hook with additional_dependencies (`types-requests`, `pandas-stubs`) and `--config-file=mypy.ini`.
CI workflow & lint deps `.github/workflows/ci.yml`, `pyproject.toml`, `.gitignore`	New GitHub Actions CI (python 3.12): checkout, setup, install editable extras, ruff lint/format checks, mypy, run tests; add lint deps to pyproject and adjust .gitignore entries.
Core typing rollout `benchwise/core.py`, `benchwise/metrics.py`, `benchwise/models.py`, `benchwise/results.py`, `benchwise/datasets.py`	Widespread typing additions and refinements (TypedDicts, Protocols, ParamSpec/TypeVar/Awaitable), explicit return annotations, cast usage, and adjusted public signatures/return types.
Client & API `benchwise/client.py`	Typed async context manager and lifecycle methods; `_make_request_with_retry(**kwargs)`; typed JSON casts for API responses; OfflineQueueItem and strong return annotations for create/upload/sync/health methods.
CLI, config & logging `benchwise/cli.py`, `benchwise/config.py`, `benchwise/logging.py`	CLI and config functions annotated with new public types (ConfigureArgs, ConfigDict, etc.), explicit return types, masking of api_key in to_dict, and validate_api_connection wiring using asyncio.
Results, caching & analysis `benchwise/results.py`	Standardize result/benchmark types to new TypedDicts (EvaluationResultDict, BenchmarkResultDict), metadata/dataset_info typed, serialization/save methods annotated to return None, and analysis return types updated.
Dataset utilities `benchwise/datasets.py`	Dataset/data/schema typed; load/create helpers accept Union inputs; added validation helpers, get_statistics, and convert_metadata_to_info utility.
Models & adapters `benchwise/models.py`	ModelAdapter and adapters accept `ModelConfig`, annotate pricing as `PricingInfo`, normalize `generate(..., **kwargs: Any)` and cast numeric pricing values.
Metrics implementations `benchwise/metrics.py`	Metric functions return newly defined TypedDict result types, explicit numeric casts, CI handling, and MetricCollection typing adjustments.
Exceptions & logging `benchwise/exceptions.py`, `benchwise/logging.py`	Minor signature typing changes (e.g., Optional[int] for retry_after) and explicit -> None annotations.
Docs, tests & demo `README.md`, `docs/test_load_dataset.py`, `demo.py`, `tests/...`	README: add Development section (Type Safety, Running Tests, Code Quality). Tests/docs/demo: formatting, quoting normalization, grouped patch contexts, assertion formatting, and a main runnable example in docs test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–75 minutes

Areas needing extra attention:
- benchwise/types.py — verify TypedDict and Protocol shapes align with runtime payloads and downstream uses.
- benchwise/client.py — confirm casts/Optional handling do not hide mismatched API shapes or KeyError risks.
- benchwise/core.py — ensure evaluate/benchmark/stress_test decorator typing preserves sync/async behavior and wrapper semantics.
- benchwise/metrics.py & benchwise/results.py — validate numeric casts and renamed/typed result fields used by consumers.
- CI / pre-commit / mypy config — confirm mypy strict flags and pre-commit mypy extras match CI environment packages.

Poem

🐇
I hopped through types and stitched each little seam,
TypedDict carrots gleam in a neat, tidy dream.
Mypy hums a tune while ruff straightens fur,
Tests line up polite — one, two, purr.
A nibble of clarity, then off I dash — thump!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Feature/improve type annotations' is directly related to the main change: a comprehensive type annotation improvement across the codebase, including new types.py module, updated function signatures, return types, and mypy configuration.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/improve-type-annotations

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yanurag-dev · 2025-12-02T18:20:47Z

@coderabbitai review

coderabbitai · 2025-12-02T18:20:52Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

yanurag-dev · 2025-12-02T18:25:13Z

@CodeRabbit full review

coderabbitai · 2025-12-02T18:25:18Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchwise/config.py (1)

451-459: Fix unused anthropic_client to satisfy Ruff

Ruff is right that anthropic_client is assigned but never used. Since you’re currently just assuming validity if the key exists (per the comment), the variable isn’t needed.

Minimal fix:

-    if os.getenv("ANTHROPIC_API_KEY"):
-        try:
-            import anthropic
-
-            anthropic_client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
-            # Note: Anthropic doesn't have a simple test endpoint
-            results["anthropic"] = True  # Assume valid if key exists
-        except Exception:
-            results["anthropic"] = False
+    if os.getenv("ANTHROPIC_API_KEY"):
+        try:
+            import anthropic  # noqa: F401
+            # Note: Anthropic doesn't have a simple test endpoint
+            results["anthropic"] = True  # Assume valid if key exists
+        except Exception:
+            results["anthropic"] = False

(or drop the import anthropic entirely if you’re comfortable treating any non-empty key as “valid”.)

🧹 Nitpick comments (8)

test_single_doc_file.py (2)
63-63: Remove redundant import.

The re module is already imported at the module level (line 29). This local import is unnecessary.
 def prepare_code_for_real_models(code: str) -> str:
     """
     Replace model names with real OpenAI and Google models.
     Ensures we use exactly 2 models: gpt-3.5-turbo and gemini-2.5-flash
     """
-    import re
-
     # Find all @evaluate decorators and replace models to ensure diversity
224-224: Add return type annotation.

Given the PR's focus on type annotations, this function should have an explicit -> int return type for consistency.
-def main():
+def main() -> int:
benchwise/metrics.py (3)
193-285: BLEU metrics: dynamic structure vs. strong result type

bleu_score now returns Dict[str, Any] while BleuScores in benchwise/types.py describes a more constrained schema. Given the richer keys here (corpus_bleu, sentence_bleu, std_sentence_bleu, bleu_n, etc.), keeping the broad Dict[str, Any] is reasonable, but the TypedDict can be misleading.

Consider either:

Aligning BleuScores with this actual return structure, or

Documenting that BleuScores is for a different, higher-level aggregation, not this function.

681-720: Perplexity: explicit float casting is good; consider similar consistency elsewhere

Casting np.mean / np.median to float for mean_perplexity and median_perplexity is good for JSON-serialization friendliness and type clarity.

You might consider doing the same in other metrics that still expose numpy scalars (bert_score_metric, semantic_similarity, factual_correctness, coherence_score, safety_score) for consistency.

899-969: Dead expression in keyword overlap helper

In _calculate_enhanced_keyword_overlap, this line is a no-op:
" ".join(ref_words)
It doesn’t affect behavior and can be safely removed for clarity.
-    important_ref_words: set[str] = set()
-    " ".join(ref_words)
+    important_ref_words: set[str] = set()
benchwise/types.py (1)

42-139: Metric result TypedDicts vs. actual metric outputs

The metric result TypedDicts (RougeScores, BleuScores, BertScoreResults, AccuracyResults, SemanticSimilarityResults, etc.) don’t currently mirror the exact shapes returned by the implementations in benchwise/metrics.py (e.g., bleu_score uses corpus_bleu/sentence_bleu and bleu_n keys, semantic_similarity returns mean_similarity, median_similarity, etc.).

If these types are meant to describe:

The raw function outputs → consider aligning the field names and optionality.

Higher-level, normalized results → consider clarifying that in docstrings and possibly renaming to avoid confusion.
benchwise/datasets.py (1)
294-297: Unreachable code path.

This raise ValueError is unreachable because:

isinstance(source, dict) → returns at line 221

isinstance(source, (str, Path)) → either returns or raises in all branches

No other type satisfies the Union[str, Path, Dict[str, Any]] signature

The type checker may flag this as dead code. Consider removing it or adding a comment explaining it's a defensive fallback for type safety.
-    raise ValueError(
-        f"Unable to load dataset from source: {source}"
-    )
+    # This line should be unreachable given the type signature,
+    # but serves as a defensive fallback for runtime type errors
+    raise ValueError(f"Unable to load dataset from source: {source}")
benchwise/models.py (1)

90-91: Redundant casts on PricingInfo values.

Since PricingInfo already defines input and output as float, the float() casts are unnecessary. However, they're harmless and provide runtime safety if the dict comes from an untrusted source.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2569de and 87bb74c.

📒 Files selected for processing (18)

.github/workflows/ci.yml (1 hunks)
.pre-commit-config.yaml (1 hunks)
GEMINI.md (1 hunks)
README.md (5 hunks)
benchwise/cli.py (9 hunks)
benchwise/client.py (18 hunks)
benchwise/config.py (14 hunks)
benchwise/core.py (10 hunks)
benchwise/datasets.py (14 hunks)
benchwise/exceptions.py (2 hunks)
benchwise/logging.py (1 hunks)
benchwise/metrics.py (24 hunks)
benchwise/models.py (11 hunks)
benchwise/results.py (12 hunks)
benchwise/types.py (1 hunks)
mypy.ini (1 hunks)
pyproject.toml (1 hunks)
test_single_doc_file.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (7)

benchwise/config.py (1)

benchwise/types.py (1)

ConfigDict (180-194)

benchwise/models.py (2)

benchwise/types.py (2)

ModelConfig (24-33)

PricingInfo (36-39)

tests/conftest.py (1)

generate (180-186)

benchwise/datasets.py (1)

benchwise/types.py (2)

DatasetItem (142-157)

DatasetMetadata (160-168)

benchwise/cli.py (3)

benchwise/metrics.py (1)

accuracy (439-550)

benchwise/datasets.py (1)

references (67-81)

run_tests.py (1)

main (110-250)

benchwise/client.py (3)

tests/test_rate_limits.py (1)

json (18-19)

benchwise/core.py (1)

benchmark (136-154)

benchwise/datasets.py (1)

get (420-421)

benchwise/types.py (3)

benchwise/metrics.py (1)

accuracy (439-550)

benchwise/datasets.py (3)

size (46-48)

prompts (51-64)

references (67-81)

benchwise/results.py (1)

success (37-39)

benchwise/core.py (4)

benchwise/metrics.py (1)

evaluate (1494-1510)

benchwise/datasets.py (1)

Dataset (13-200)

benchwise/results.py (1)

EvaluationResult (12-76)

benchwise/cli.py (1)

run_evaluation (114-285)

🪛 GitHub Actions: CI

benchwise/config.py

[error] 455-455: ruff: Local variable 'anthropic_client' assigned to but never used

benchwise/datasets.py

[error] 9-9: ruff: 'DatasetItem' imported but unused

[error] 9-9: ruff: 'DatasetMetadata' imported but unused

benchwise/cli.py

[error] 8-8: ruff: 'typing.Dict' imported but unused

benchwise/metrics.py

[error] 1-1: ruff: 'Union' imported but unused

[error] 3-3: ruff: 'NDArray' imported but unused

[error] 4-4: ruff: 'BleuScores' imported but unused

[error] 4-4: ruff: 'BertScoreResults' imported but unused

[error] 4-4: ruff: 'AccuracyResults' imported but unused

🔇 Additional comments (54)

GEMINI.md (1)

1-1: LGTM!

Simple cross-reference to consolidate AI assistant instructions in a single location (CLAUDE.md).

README.md (1)

181-227: Good documentation of the new type safety requirements.

The Development section clearly documents the type checking setup, test commands, and code quality tooling. This aligns well with the PR's goal of introducing strict type checking across the codebase.

benchwise/logging.py (1)

71-71: LGTM!

Adding the explicit -> None return type annotation improves type safety and aligns with the PR's objective of comprehensive type annotations.

benchwise/exceptions.py (1)

7-8: LGTM!

The addition of Optional import and explicit type annotations on RateLimitError.__init__ improves type safety without changing runtime behavior.

Also applies to: 22-25

benchwise/metrics.py (5)

64-162: ROUGE-L typing and casting look consistent

Using RougeScores as the return type and explicitly casting np.mean/np.std outputs to float avoids numpy-scalar surprises and matches the TypedDict shape (including the nested scores: Dict[str, List[float]]). No issues from a typing or behavior standpoint here.

287-303: Smoothing function typing is fine; ignores are appropriate

Returning Optional[Callable[..., Any]] and using # type: ignore[no-any-return] on bound methods is a pragmatic way to appease mypy given the NLTK stubs. The behavior remains unchanged and matches sentence_bleu’s expectations.

1012-1088: Coherence score casting and component typing look solid

Typing component_scores as Dict[str, List[float]] and casting overall_coherence to float keeps the result structure predictable and avoids numpy scalars leaking into top-level fields. This plays nicely with mypy and downstream JSON consumers.

1260-1441: Safety score: good use of optional parameters and typed structures

The added type annotations for safety_categories, perspective_api_key, category_scores, and violation_details are consistent with how the function is used. Logic remains unchanged and the result dict structure is clear.

1484-1511: MetricCollection typing matches usage

Using Dict[str, Tuple[Callable[..., Any], Dict[str, Any]]] for self.metrics and an optional references parameter in evaluate lines up with how you register and invoke the metric functions. This should make mypy happier without altering behavior.

benchwise/types.py (2)

141-217: Dataset and config TypedDicts look appropriate

DatasetItem, DatasetMetadata, DatasetSchema, and ConfigDict line up with how datasets and config are described in the rest of the codebase (e.g., benchwise/datasets.py, BenchwiseConfig.to_dict). Using total=False is sensible given the optional nature of many fields.

228-291: Protocols provide useful structural contracts

The SupportsGenerate, SupportsCache, and SupportsMetrics Protocols capture the expected interfaces cleanly and should integrate well with adapters and caching layers. Async generate and sync evaluate/save/load signatures look reasonable.

pyproject.toml (1)

69-75: Typing stub additions align with mypy usage

Adding pandas-stubs and types-requests to the lint extra matches the new mypy/pre-commit configuration and avoids ignore_missing_imports for those libraries. No issues here.

.pre-commit-config.yaml (1)

20-28: Mypy pre-commit hook is wired correctly

Using mirrors-mypy with additional_dependencies matching your stub packages and args: [--config-file=mypy.ini] looks good. Restricting to files: ^benchwise/ keeps checks focused on library code.

.github/workflows/ci.yml (1)

1-36: CI workflow matches tooling stack

The CI job (ruff lint/format, mypy with mypy.ini, and basic tests via run_tests.py) is consistent with the new typing and linting setup. Using Python 3.12 while configuring mypy for 3.11 is fine given 3.11 is your minimum supported version.

mypy.ini (2)

1-24: Strict mypy configuration is coherent

Enabling strict plus the additional flags, and limiting checks to benchwise is a solid baseline. Showing error codes will help with targeted suppressions.

25-63: Third-party ignore sections are sensible

Ignoring or skipping imports for libraries without stubs (rouge_score, bert_score, nltk, transformers, etc.) while relying on typed stubs for pandas/requests matches the dependencies declared in pyproject.toml. This should keep noise down while still catching real type issues.

benchwise/cli.py (2)

205-235: Using Any for metric_result is pragmatic

Annotating metric_result as Any when indexing into the dict returned by accuracy avoids over-constraining the type and matches the current implementation of benchwise.metrics.accuracy. No behavior change, and it keeps mypy satisfied.

288-513: CLI helper annotations improve clarity

Adding explicit -> None / -> BenchmarkResult / -> None return types to configure_api, sync_offline, show_status, list_resources, validate_dataset, compare_results, and main makes the CLI surface clearer without altering behavior. The async/sync boundaries and asyncio.run usage remain consistent.

benchwise/config.py (5)

56-65: Config list fields and post_init typing look good

Typing default_models and default_metrics as List[str] with field(default_factory=list) plus a __post_init__ that delegates to the existing loaders/validators is a clean improvement and lines up with ConfigDict in benchwise.types.

174-233: ConfigDict integration is consistent

to_dict returning ConfigDict (with a masked api_key) matches the TypedDict definition and keeps the shape of configuration data explicit for any consumers (printing, status commands, etc.).

274-317: configure_benchwise kwargs typing is appropriate

Typing **kwargs: Any and re-validating via _validate_config() after applying overrides is a good balance between flexibility and safety. This should work well under the new strict mypy settings.

414-419: validate_api_connection: clear boolean typing

Explicitly annotating the inner check_connection() -> bool and wrapping the status code comparison in bool(...) clarifies intent and aligns with the function’s declared -> bool return type. Behavior is unchanged.

487-516: print_configuration_status nicely composes existing helpers

Using config.print_config(), validate_api_connection, validate_api_keys, and get_cache_dir in print_configuration_status gives a good, centralized health view. The function’s -> None annotation and printed statuses match its intended use as a diagnostic helper.

benchwise/datasets.py (5)

29-38: LGTM!

The -> None return type annotation for __post_init__ is correct and aligns with dataclass conventions.

83-91: LGTM!

The Callable[[Dict[str, Any]], bool] type hint accurately describes the filter condition, and the defensive self.metadata or {} pattern prevents potential None spread errors.

176-199: LGTM!

The explicit type annotations and type narrowing pattern (filtering with isinstance before numeric operations) satisfy strict type checkers while maintaining runtime safety.

299-408: LGTM!

The **kwargs: Any type annotations on create_qa_dataset, create_summarization_dataset, and create_classification_dataset are appropriate for their flexible metadata usage.

411-431: LGTM!

The -> None return type annotations on __init__, register, remove, and clear methods correctly indicate they perform mutations without returning values.

benchwise/client.py (7)

5-5: LGTM!

Adding cast from typing enables explicit type assertions for JSON response handling throughout the client.

22-26: LGTM!

The Optional[int] and Optional[str] type annotations correctly reflect that these exception parameters may be absent.

59-78: LGTM!

The type annotations for offline_queue, async context manager methods, and close() are correct. Using Any for __aexit__ exception parameters is a common and acceptable pattern.

325-360: LGTM!

The cast() calls provide type safety for JSON responses. These assume a stable API contract; if the API response structure changes, runtime errors would occur at the cast sites, making debugging easier.

467-504: LGTM!

The Optional[Dict[str, Any]] type for metadata and consistent use of cast() for response handling align with the file's type safety approach.

744-751: LGTM!

The -> None return type annotation correctly indicates this method performs a side effect (appending to queue) without returning a value.

824-828: LGTM!

The nested cast() calls for extracting file_info["url"] from the response provide type safety for the chained dictionary access.

benchwise/models.py (5)

4-5: LGTM!

Importing ModelConfig and PricingInfo from the new types module enables consistent typing across all model adapters.

10-17: LGTM!

The type annotations are correct. Since ModelConfig is a TypedDict with total=False, an empty dict {} is a valid instance, making config or {} type-safe.

47-52: LGTM!

The Dict[str, PricingInfo] annotation correctly describes the pricing structure, and all model entries conform to the PricingInfo TypedDict with required input and output float keys.

95-273: LGTM!

All model adapters (AnthropicAdapter, GoogleAdapter, HuggingFaceAdapter, MockAdapter) consistently implement the type annotations with Optional[ModelConfig] for config and List[str] for generate return types.

275-298: LGTM!

The factory function correctly accepts Optional[ModelConfig] and returns the ModelAdapter base type, allowing for polymorphic usage of different adapter implementations.

benchwise/results.py (6)

60-76: LGTM!

The Optional[str] type for metric_name correctly indicates that None means "return the main result," and the Union[float, Any] return type accommodates both numeric scores and arbitrary result types.

96-154: LGTM!

The -> None for add_result and Optional[EvaluationResult] returns for get_best_model/get_worst_model correctly handle empty results scenarios.

156-196: LGTM!

The compare_models method correctly uses Optional[str] for metric_name and returns a comprehensive comparison dictionary with consistent typing.

244-253: LGTM!

The -> None return type annotations on save_to_json and save_to_csv correctly indicate these are side-effect-only methods.

258-299: LGTM!

The explicit Dict[str, Any] type annotation on comparison and the conversion of set() to list() before return ensures type consistency.

547-549: LGTM!

The -> None return type correctly indicates this function performs file I/O without returning a value.

benchwise/core.py (8)

1-1: LGTM!

Adding ParamSpec, TypeVar, Awaitable, and Union enables sophisticated generic typing for decorators while maintaining runtime flexibility with Any.

13-16: LGTM!

ParamSpec and TypeVar definitions follow the standard pattern for preserving callable signatures in decorator type annotations.

20-60: LGTM!

The complex return type annotation accurately describes the decorator transformation: a test function becomes a wrapper that accepts a Dataset and returns List[EvaluationResult]. The # type: ignore[attr-defined] is appropriate for the dynamic _benchmark_metadata attribute.

136-154: LGTM!

The benchmark decorator correctly uses Callable[[Callable[P, R]], Callable[P, R]] to indicate it preserves the original function's signature. The # type: ignore[attr-defined] is appropriate for the dynamic metadata attribute.

157-194: LGTM!

The stress_test decorator correctly uses ParamSpec for signature preservation and Union[R, BaseException] to reflect asyncio.gather(return_exceptions=True) semantics.

200-272: LGTM!

The EvaluationRunner class has comprehensive type annotations: explicit field types in __init__, Callable[..., Awaitable[Any]] for flexible test functions, and Optional[str] for optional metric selection.

275-298: LGTM!

The run_benchmark and quick_eval functions have precise type annotations: Callable[..., Awaitable[Any]] for flexible async functions and Callable[[str], float] for the metric function that scores individual responses.

63-70: > Likely an incorrect or invalid review comment.

benchwise/cli.py

benchwise/datasets.py

benchwise/metrics.py

test_single_doc_file.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchwise/datasets.py (1)

213-305: load_dataset URL handling is broken for HTTP(S) JSON sources

The code calls source_path = Path(source) before checking for HTTP URLs (line 234), causing strings like "https://example.com/data.json" to be treated as local paths with .json suffix. The condition if source_path.suffix == ".json" (line 237) then executes first, attempting to open the URL as a local file with open(source_path, "r") (line 238), resulting in FileNotFoundError. The later elif str(source).startswith(("http://", "https://")) branch (line 271) is unreachable for any .json URL.

Detect HTTP(S) URLs before treating the source as a filesystem path:

-    elif isinstance(source, (str, Path)):
-        source_path = Path(source)
-
-        if source_path.suffix == ".json":
+    elif isinstance(source, (str, Path)):
+        # Handle HTTP(S) URLs first
+        if isinstance(source, str) and source.startswith(("http://", "https://")):
+            source_str = source
+            response = requests.get(source_str)
+            response.raise_for_status()
+
+            if source_str.endswith(".json"):
+                data = response.json()
+                if isinstance(data, dict) and "data" in data:
+                    return Dataset(
+                        name=data.get("name", "remote_dataset"),
+                        data=data["data"],
+                        metadata=data.get("metadata", {}),
+                        schema=data.get("schema"),
+                    )
+                elif isinstance(data, list):
+                    return Dataset(
+                        name=kwargs.get("name", "remote_dataset"),
+                        data=data,
+                        metadata=kwargs.get("metadata", {}),
+                    )
+                else:
+                    raise ValueError(
+                        f"Invalid JSON format from '{source_str}'. Expected a list or a dict with 'data' key."
+                    )
+            else:
+                raise ValueError(
+                    f"Unsupported URL format '{source_str}'. Only .json URLs are supported."
+                )
+
+        else:
+            source_path = Path(source)
+
+            if source_path.suffix == ".json":
             ...
-        elif source_path.suffix == ".csv":
+            elif source_path.suffix == ".csv":
             ...
-        elif str(source).startswith(("http://", "https://")):
-            # Convert to str for requests.get
-            source_str = str(source)
-            response = requests.get(source_str)
-            ...
-        else:
-            raise ValueError(
-                f"Unsupported file format '{source_path.suffix}'. Supported formats: .json, .csv"
-            )
+            else:
+                raise ValueError(
+                    f"Unsupported file format '{source_path.suffix}'. Supported formats: .json, .csv"
+                )

🧹 Nitpick comments (9)

benchwise/cli.py (2)
8-8: Tighten typing for metric_result/results instead of using Any

accuracy (and the other metric functions) return dictionaries, so declaring metric_result: Any = ... loses most of the benefit of the new type hints, and results is currently untyped as well.

If you want stricter static checking here, consider something along these lines:
-from typing import List, Optional, Any
+from typing import Any, Dict, List, Optional
...
-            results = {}
+            results: Dict[str, Any] = {}
             for metric_name in metrics:
-                try:
-                    if metric_name == "accuracy":
-                        metric_result: Any = accuracy(responses, references)
+                try:
+                    metric_result: Dict[str, Any]
+                    if metric_name == "accuracy":
+                        metric_result = accuracy(responses, references)
                         results["accuracy"] = metric_result["accuracy"]
                     elif metric_name == "rouge_l":
                         metric_result = rouge_l(responses, references)
                         results["rouge_l_f1"] = metric_result["f1"]
                     elif metric_name == "semantic_similarity":
                         metric_result = semantic_similarity(responses, references)
This keeps runtime behavior identical while giving mypy/pyright a clearer view of the data flowing through this block.

Also applies to: 205-213

288-355: CLI helper type hints look consistent; consider narrowing args: Any

The new return annotations and parameter types for list_resources, validate_dataset, compare_results, and main look correct and match how these functions are used from the CLI.

For the async helpers that take args:
async def configure_api(args: Any) -> None:
async def sync_offline(args: Any) -> None:
async def show_status(args: Any) -> None:
args is always produced by parser.parse_args(), so its runtime type is argparse.Namespace. If you want slightly stronger typing without major refactors, you could narrow these to:
-import argparse
+import argparse
...
-async def configure_api(args: Any) -> None:
+async def configure_api(args: argparse.Namespace) -> None:
...
-async def sync_offline(args: Any) -> None:
+async def sync_offline(args: argparse.Namespace) -> None:
...
-async def show_status(args: Any) -> None:
+async def show_status(args: argparse.Namespace) -> None:
This is optional but makes the intent clearer and avoids Any in a few more places while keeping the current CLI behavior unchanged.

Also applies to: 357-413, 415-479, 481-483, 514-571
benchwise/metrics.py (4)

25-44: Float casting in _bootstrap_confidence_interval stabilizes the return type

Wrapping np.mean/np.percentile results in float(...) ensures the helper always returns plain Python floats, which is friendlier for typing and JSON serialization. Behavior for empty scores (returning (nan, nan) when scores == []) is unchanged but a bit surprising; consider explicitly guarding against the empty-list case if you ever expect it.

69-69: rouge_l result now matches the structured RougeScores shape

The RougeScores return annotation, explicit scores: Dict[str, List[float]], and casting of aggregates/std-devs to float make the output shape and types much clearer and consistent with the typed dict definition. One minor completeness thought: benchwise/types.py’s docstring mentions a rougeL_f1 field; if that key is meant to be part of RougeScores, you may want to either (a) add it to scores/result, or (b) adjust the type/docstring so f1 is clearly the custom ROUGE-L F-score you already compute.

Also applies to: 100-106, 140-150

170-177: BLEU helpers’ typing is clearer; consider minor polish only

Typing bleu_score as Dict[str, Any] and explicitly annotating ngram_precisions: Dict[str, List[float]] clarify the somewhat dynamic metric dict. _get_smoothing_function’s Optional[Callable[..., Any]] and _get_weights’s Tuple[float, ...] are appropriate, and the type: ignore comments around NLTK’s bound methods are a pragmatic way to satisfy static checkers. No correctness issues spotted.

Also applies to: 214-216, 294-307, 310-314

797-797: Factual correctness helpers’ typing is mostly solid; remove stray no-op expression

Annotating _analyze_factual_correctness and _calculate_entity_overlap with concrete str/float types and nlp_model: Any keeps spaCy optional without over-constraining the interface, and the enhanced keyword overlap typing (important_ref_words: set[str]) is clear.

There is one small cleanup opportunity: line 929 (" ".join(ref_words)) is a standalone expression whose result is ignored and can be dropped without changing behavior.

Also applies to: 844-848, 883-883, 909-979

benchwise/datasets.py (1)

81-90: Filter/sample/split metadata handling is reasonable, but size remains original

The new Callable[[Dict[str, Any]], bool] type on filter, and use of metadata = self.metadata or {} with metadata={**metadata, ...} for filter, sample, and split are all sound and avoid mutating the original metadata dict.

Note that the size in propagated metadata continues to reflect the original dataset rather than the filtered/sample/split sizes; that’s pre-existing behavior but might surprise consumers expecting size == len(data) on derived datasets. Consider updating or adding a separate field (e.g., effective_size) if you ever revisit metadata semantics.

Also applies to: 91-105, 122-140

benchwise/config.py (2)

400-420: validate_api_connection async usage can misreport failures inside running event loops

Wrapping the HTTP health check in an async def check_connection() and calling asyncio.run(check_connection()) works well in plain scripts/CLIs, but asyncio.run raises RuntimeError when called from an already running event loop (e.g., some notebook or async framework contexts). In those cases you’ll fall into the broad except and return False even if the API is reachable.

If you want this helper to be robust in such environments, consider either:

Switching to a synchronous httpx.Client call here, or

Detecting a running loop and using it instead of asyncio.run.

As it stands this is acceptable for CLI-style diagnostics but may give false negatives in async-heavy applications.

427-486: API key validation and configuration status are useful diagnostics; be mindful of network cost

validate_api_keys now uses provider-specific clients/endpoints (OpenAI client, Anthropic client construction, Google Generative AI list_models, and Hugging Face’s /api/whoami) to infer key validity, and print_configuration_status wires that together with validate_api_connection and cache-dir checks. This is a good fit for an on-demand diagnostic command.

Given these helpers perform live network calls and, in some cases, potentially heavy operations (e.g., listing all models), it’s worth keeping them out of hot paths and reserving them for explicit “status”/debug flows, which you are already doing.

Also applies to: 488-517

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87bb74c and e6254c3.

📒 Files selected for processing (7)

.gitignore (1 hunks)
benchwise/cli.py (9 hunks)
benchwise/client.py (24 hunks)
benchwise/config.py (14 hunks)
benchwise/datasets.py (14 hunks)
benchwise/metrics.py (24 hunks)
mypy.ini (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

mypy.ini
benchwise/client.py

🧰 Additional context used

🧬 Code graph analysis (4)

benchwise/metrics.py (3)

benchwise/types.py (1)

RougeScores (43-58)

benchwise/core.py (1)

evaluate (20-60)

benchwise/datasets.py (1)

references (65-79)

benchwise/datasets.py (1)

tests/test_integration_real_api.py (1)

test_dataset (23-28)

benchwise/config.py (1)

benchwise/types.py (1)

ConfigDict (180-194)

benchwise/cli.py (2)

benchwise/metrics.py (1)

accuracy (446-557)

benchwise/datasets.py (1)

references (65-79)

🔇 Additional comments (12)

benchwise/metrics.py (4)

1-3: Import cleanup and RougeScores wiring look good

Removing unused typing/NDArray imports and importing only RougeScores from benchwise.types aligns with prior Ruff feedback and keeps the module’s surface tidy. No issues here.

317-324: Return-type widening for metric functions is consistent with their dynamic outputs

Using Dict[str, Any] for bert_score_metric, accuracy, semantic_similarity, and perplexity, plus casting perplexity aggregates to float, accurately reflects that these functions return heterogeneous metric dictionaries while still giving callers a predictable top-level shape. This aligns with how the rest of the metrics module structures its outputs.

Also applies to: 446-455, 560-568, 688-689, 724-725

1042-1042: Coherence-related helpers: numeric typing and casts are coherent

Explicitly typing component_scores: Dict[str, List[float]], casting overall_coherence and the length coefficient-of-variation to float, and returning float(np.mean(overlaps)) from _calculate_topic_consistency all help static analysis and avoid leaking numpy scalar types. The semantics stay the same and the changes are safe.

Also applies to: 1061-1061, 1119-1125, 1268-1268

1272-1279: Safety metrics and MetricCollection typing align with usage

Annotating safety_categories/perspective_api_key as Optional[...], making category_scores: Dict[str, List[float]] and violation_details: List[List[str]], and typing MetricCollection.metrics as Dict[str, Tuple[Callable[..., Any], Dict[str, Any]]] give a realistic type view of how metrics and their kwargs are stored and invoked. The evaluate method’s references: Optional[List[str]] matches how you mix metrics that do and don’t need references. No behavioral regressions spotted.

Also applies to: 1357-1358, 1498-1503, 1508-1519

benchwise/datasets.py (5)

1-1: Typing imports and __post_init__ annotation are appropriate

Adding Callable/Union to the imports and annotating __post_init__ as returning None match how the class is used and keep the dataclass lifecycle explicit. No issues here.

Also applies to: 27-27

65-79: Including label in references improves classification support

Extending the reference extraction to fall back to item.get("label") makes the Dataset.references property usable for simple classification datasets without extra wiring. Converting to str as before keeps the return type consistent.

163-167: to_csv and get_statistics typing/numeric handling are solid

Annotating to_csv as returning None matches its side-effect-only behavior. In get_statistics, explicitly typing fields: List[str] and stats: Dict[str, Any], then narrowing to numeric_values under the all(isinstance(..., (int, float))) guard, makes the numeric summary logic clearer and type-checker-friendly without changing semantics.

Also applies to: 182-210

307-343: Dataset factory functions: type annotations align with usage

Adding **kwargs: Any and explicit List[str] annotations for questions/answers, documents/summaries, and texts/labels matches the way these helpers are used (including in tests/test_integration_real_api.py) and makes their contracts clearer. The schemas and metadata blocks are unchanged and remain consistent.

Also applies to: 345-382, 385-418

421-441: DatasetRegistry typing is straightforward and correct

Typing the registry’s internal map as Dict[str, Dataset] and annotating __init__, register, remove, and clear as returning None documents the side-effect-only API and avoids ambiguity for callers. No behavior changes introduced.

benchwise/config.py (3)

10-10: Config typing imports and default model/metric lists are well integrated

Importing List and ConfigDict, then declaring default_models/default_metrics as List[str] = field(default_factory=list), cleanly extends the configuration surface and avoids the shared-mutable-default pitfall. The ConfigDict alias ties the in-memory config and its serialized form together nicely.

Also applies to: 14-14, 57-59

60-67: Internal config lifecycle methods are now explicitly typed and safer to consume

Adding -> None to __post_init__, _load_from_env, _load_from_file, _validate_config, and save_to_file makes their side-effect-only nature explicit. Persisting default_models/default_metrics in save_to_file, and returning a ConfigDict from to_dict that masks api_key as "***" when present, are good choices for both type safety and secret handling. print_config builds on that masked dict, which avoids leaking real keys.

Also applies to: 118-145, 174-199, 217-233, 235-243

263-272: Global config helpers: signatures and kwargs handling look correct

Annotating set_api_config and reset_config as returning None, and configure_benchwise(..., **kwargs: Any) -> BenchwiseConfig, makes the public API clearer. The kwargs loop still only sets attributes that exist on the config object, which keeps this extension-safe while preventing accidental misspellings from silently adding new attributes. Re-validating via config._validate_config() after changes is also the right call.

Also applies to: 274-281, 296-317, 320-324

.gitignore

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/test_models.py (1)

83-86: Verify Python 3.10+ requirement for parenthesized context managers.

The parenthesized context manager syntax at these locations requires Python 3.10+. This is the same syntax change applied in tests/test_integration.py (lines 112-115).

Also applies to: 201-204, 209-212

🧹 Nitpick comments (5)

benchwise/types.py (3)
21-26: Potential duplicate type variable definition.

P and R are also defined in benchwise/core.py (lines 24-25). If both modules are imported together, this could cause confusion or shadowing issues. Consider importing these from types.py in core.py instead of redefining them.

107-116: AccuracyResults TypedDict is incomplete.

The accuracy() function in metrics.py returns additional fields not defined here: exact_accuracy, fuzzy_accuracy, correct_fuzzy, mean_score, std_score, individual_scores, and match_types. Consider adding these for complete type coverage.
 class AccuracyResults(TypedDict, total=False):
     """Return type for accuracy metric."""

     accuracy: float
+    exact_accuracy: float
+    fuzzy_accuracy: float
     correct: int
+    correct_fuzzy: int
     total: int
-    std_accuracy: float
-    scores: List[float]
+    mean_score: float
+    std_score: float
+    individual_scores: List[float]
+    match_types: List[str]
     # Optional confidence interval
     accuracy_confidence_interval: Tuple[float, float]
163-179: DatasetItem is missing fields used by Dataset class.

The Dataset.prompts property (in datasets.py) also checks for document field, and Dataset.references checks for label field. Add these for completeness.
 class DatasetItem(TypedDict, total=False):
     """A single item in a dataset."""

     # Common field names
     prompt: str
     input: str
     question: str
     text: str
+    document: str  # For summarization datasets
     # Reference/target fields
     reference: str
     output: str
     answer: str
     target: str
     summary: str
+    label: str
     # Additional fields
     id: str
     metadata: Dict[str, Any]
benchwise/core.py (2)
23-25: Import type variables from benchwise.types instead of redefining.

P and R are also defined in benchwise/types.py. For consistency with the PR's goal of centralizing type definitions, import them from there.
+from .types import P, R
+
-# Type variables for decorator typing
-P = ParamSpec("P")
-R = TypeVar("R")
30-35: Return type annotation may not precisely match wrapper signature.

The wrapper function signature is (dataset: Dataset, **test_kwargs: Any), but the return type annotation specifies Callable[[Dataset, Any], Awaitable[...]]. Consider using Callable[..., Awaitable[List[EvaluationResult]]] for the wrapper type, or documenting that Any represents **kwargs.
 def evaluate(
     *models: str, upload: Optional[bool] = None, **kwargs: Any
 ) -> Callable[
     [Callable[..., Awaitable[Any]]],
-    Callable[[Dataset, Any], Awaitable[List[EvaluationResult]]],
+    Callable[..., Awaitable[List[EvaluationResult]]],
 ]:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6254c3 and 88c052d.

📒 Files selected for processing (9)

benchwise/core.py (12 hunks)
benchwise/exceptions.py (1 hunks)
benchwise/logging.py (1 hunks)
benchwise/results.py (12 hunks)
benchwise/types.py (1 hunks)
tests/test_config.py (1 hunks)
tests/test_docs_examples.py (7 hunks)
tests/test_integration.py (1 hunks)
tests/test_models.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

tests/test_config.py

🚧 Files skipped from review as they are similar to previous changes (2)

benchwise/exceptions.py
benchwise/results.py

🧰 Additional context used

🧬 Code graph analysis (3)

tests/test_models.py (1)

benchwise/models.py (1)

HuggingFaceAdapter (208-250)

benchwise/types.py (4)

benchwise/metrics.py (1)

accuracy (446-557)

benchwise/datasets.py (3)

size (44-46)

prompts (49-62)

references (65-79)

benchwise/results.py (1)

success (37-39)

benchwise/core.py (1)

evaluate (30-79)

benchwise/core.py (4)

benchwise/types.py (1)

evaluate (320-324)

benchwise/datasets.py (1)

Dataset (11-210)

benchwise/results.py (1)

EvaluationResult (12-76)

benchwise/client.py (1)

upload_results (943-1013)

🔇 Additional comments (8)

tests/test_docs_examples.py (1)

20-172: LGTM! String quote standardization improves consistency.

The changes standardize string quotes to double quotes throughout the test file, improving code consistency. No functional changes detected.

tests/test_integration.py (1)

112-115: This concern is invalid — the project requires Python 3.11+, which supports parenthesized context managers.

The project's pyproject.toml specifies requires-python = ">=3.11" and targets Python 3.11+ (ruff config: target-version = "py311"). Parenthesized context managers are supported in Python 3.10+, so this syntax is compatible with the project's minimum Python version.

benchwise/logging.py (1)

71-80: LGTM! Explicit return type annotation added.

The -> None return type annotation correctly documents the function's behavior and improves type checker support.

benchwise/types.py (1)

285-323: LGTM! Well-defined protocols.

The SupportsGenerate, SupportsCache, and SupportsMetrics protocols are correctly structured with appropriate method signatures for structural subtyping.

benchwise/core.py (4)

176-223: LGTM! Well-typed stress_test decorator.

The type annotations correctly capture the async decorator pattern and the Union[R, BaseException] return type properly reflects asyncio.gather(..., return_exceptions=True) behavior.

229-257: LGTM! Clear type annotations for EvaluationRunner.

The explicit typing for config, results_cache, and method signatures improves code clarity and enables better IDE support.

317-335: LGTM! Correct type annotations for quick_eval.

The Dict[str, Optional[float]] return type correctly captures that model evaluations may fail (returning None).

85-85: > Likely an incorrect or invalid review comment.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88c052d and 8fa061b.

📒 Files selected for processing (5)

.pre-commit-config.yaml (2 hunks)
benchwise/config.py (15 hunks)
benchwise/metrics.py (25 hunks)
tests/test_memory_large_datasets.py (8 hunks)
tests/test_results.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

benchwise/metrics.py (1)

benchwise/types.py (1)

RougeScores (56-72)

benchwise/config.py (1)

benchwise/types.py (1)

ConfigDict (204-219)

🪛 GitHub Actions: CI

benchwise/metrics.py

[error] 4-4: ModuleNotFoundError: No module named 'rouge_score' while importing benchwise.metrics during test execution.

🔇 Additional comments (13)

tests/test_results.py (1)

187-189: LGTM!

The reformatted assertion maintains the same logic while improving consistency with the test suite's formatting style. The parenthesized message format is valid and readable.

tests/test_memory_large_datasets.py (2)

32-34: Consistent assertion formatting applied throughout the file.

All assertions have been reformatted to use the parenthesized message style consistently. The memory threshold checks and test logic remain unchanged. This aligns with the PR's broader formatting standardization effort.

60-62: Formatting changes consistent with the rest of the file.

All remaining assertion reformats follow the same pattern—no logic changes, just stylistic consistency improvements.

Also applies to: 72-74, 103-105, 138-140, 167-169, 187-189, 201-203

.pre-commit-config.yaml (1)

20-28: LGTM! Mypy integration properly configured.

The mypy hook is well-configured with:

Appropriate stub packages (types-requests, pandas-stubs)

Custom config file reference

Restricted scope to benchwise/ directory

This aligns perfectly with the PR's objective to improve type annotations.

benchwise/metrics.py (5)

69-69: LGTM! Proper typing with RougeScores TypedDict.

The function signature correctly uses the RougeScores TypedDict from benchwise/types.py, and the result dictionary is properly typed and constructed with explicit float conversions for numpy values. This ensures type safety and mypy compatibility.

Also applies to: 140-149, 176-176

294-307: LGTM! Appropriate use of type: ignore for bound methods.

The type: ignore[no-any-return] comments correctly suppress mypy warnings for returning bound methods from the NLTK SmoothingFunction. This is necessary because these methods don't have proper type stubs.

1498-1503: LGTM! Proper typing for MetricCollection methods.

The method signatures correctly use:

-> None for __init__ and add_metric

Callable[..., Any] for metric functions

**kwargs: Any for flexible arguments

This provides good type safety while maintaining flexibility.

928-928: The project requires Python 3.11+ (requires-python = ">=3.11" in pyproject.toml), which fully supports PEP 585 lowercase generics. Using set[str] is the correct, idiomatic approach and no changes are needed.

1-11: Fix unconditional import of undeclared optional dependency rouge_score.

The ModuleNotFoundError: No module named 'rouge_score' occurs because line 4 imports rouge_scorer unconditionally, but rouge-score is not declared in any dependency configuration (neither install_requires nor any extras). Either add rouge-score to an optional dependency extra (e.g., metrics), or guard the import with a try-except block or conditional import since ROUGE metrics are optional functionality.

Likely an incorrect or invalid review comment.

benchwise/config.py (4)

10-14: LGTM! Consistent typing with centralized type definitions.

The changes properly integrate ConfigDict from benchwise/types.py and use List[str] for model/metric lists. The to_dict() method now has a precise return type that aligns with the TypedDict definition, improving type safety throughout the codebase.

Also applies to: 57-58, 217-233

60-64: LGTM! Comprehensive None return type annotations.

All void methods now properly annotate -> None return types, which is essential for mypy strict mode and improves code documentation. This includes initialization, validation, configuration, and printing methods.

Also applies to: 66-117, 118-143, 145-172, 174-215, 235-242, 263-271, 320-323, 488-517

427-485: validate_api_keys performs live external API calls.

The validate_api_keys() function makes actual API requests to OpenAI, Anthropic, Google GenAI, and HuggingFace to verify keys. This can:

Consume API quota/credits

Add latency to configuration checks

Fail in offline or restricted network environments

Consider documenting this behavior clearly, or making the validation opt-in rather than automatic during configuration operations.

Based on learnings about user experience with API calls during configuration.

203-205: Good security practice: API key handling.

The explanatory note added when not saving the API key clearly informs users why the key is excluded and how to set it via environment variables. This balances security with user experience.

.pre-commit-config.yaml

yanurag-dev · 2025-12-04T12:12:04Z

@CodeRabbit is it fine to use cast() ?

coderabbitai · 2025-12-04T12:12:42Z

Tip

For best results, initiate chat on the files or code changes.

@yanurag-dev Using cast() is acceptable for your use case, but with important caveats:

When `cast()` is fine (like your PR):

✅ Typing external API responses - You're using it to type JSON responses from HTTP calls where you know the expected structure from the API contract
✅ Type checker compliance - It helps satisfy mypy's strict mode without overly complex type guards
✅ Pragmatic approach - For an SDK/client library, it's a reasonable trade-off between type safety and development speed

Important considerations:

⚠️ No runtime validation - cast() is purely for type checkers. It doesn't verify the data structure at runtime:

# This will NOT fail at runtime even if response is wrong type
result = cast(Dict[str, Any], response.json())

⚠️ Trust-based - You're telling mypy "trust me, this is the type" - if the API changes, you won't get runtime errors, just incorrect types

Better alternatives (for consideration):

Pydantic models (for production-grade validation):

from pydantic import BaseModel

class UserResponse(BaseModel):
    id: int
    username: str
    email: str

user = UserResponse(**response.json())  # Runtime validation!

TypeGuard functions (for runtime checks):

from typing import TypeGuard

def is_user_dict(data: Any) -> TypeGuard[Dict[str, Any]]:
    return isinstance(data, dict) and "id" in data and "username" in data

Verdict for your PR:

Your usage is fine for this stage, especially since:

You're improving type coverage systematically
The API contract is presumably stable
You're working with Dict[str, Any] (flexible type)
It's better than having Any everywhere

Recommendation: Keep the cast() for now, but consider adding a TODO comment or follow-up issue to add runtime validation (Pydantic models) for production robustness if this library will be widely used.

🐰

…otations in cli.py, client.py, core.py, and metrics.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

benchwise/cli.py (1)
213-219: Verify key existence when accessing typed dictionary results.

The code accesses ["f1"] and ["mean_similarity"] from typed dictionaries. While the TypedDicts in benchwise/types.py define these fields with total=False, the actual implementations in benchwise/metrics.py always populate these fields. However, for robustness, consider using .get() with defaults or add explicit checks.

Apply this diff to use safer dictionary access:
                    elif metric_name == "rouge_l":
                        rouge_result = rouge_l(responses, references)
-                        results["rouge_l_f1"] = rouge_result["f1"]
+                        results["rouge_l_f1"] = rouge_result.get("f1", 0.0)
                    elif metric_name == "semantic_similarity":
                        semantic_result = semantic_similarity(responses, references)
-                        results["semantic_similarity"] = semantic_result[
-                            "mean_similarity"
-                        ]
+                        results["semantic_similarity"] = semantic_result.get(
+                            "mean_similarity", 0.0
+                        )
benchwise/client.py (1)
249-249: Consider adding runtime validation for API responses.

The code uses cast() extensively to type API responses. While this satisfies mypy and is acceptable for the current stage (as discussed in PR comments), cast() provides no runtime validation. If the API contract changes or returns unexpected data, errors will only surface when accessing specific fields.

Based on learnings, cast() is fine for this use case given the stable API contract, but consider adding a TODO or follow-up issue to implement runtime validation using Pydantic models for production robustness.

Example with Pydantic (for future consideration):
from pydantic import BaseModel

class UserResponse(BaseModel):
    id: int
    username: str
    email: str
    # ... other fields

async def get_current_user(self) -> Dict[str, Any]:
    response = await self.client.get("/api/v1/users/me")
    if response.status_code == 200:
        # Validates at runtime and raises ValidationError if invalid
        user = UserResponse.model_validate(response.json())
        return user.model_dump()
    # ... error handling
Also applies to: 276-276, 348-350, 375-384
benchwise/metrics.py (1)
150-177: Consider consistent typing pattern for result construction.

The code uses two patterns for building typed results:

Lines 150-177: Explicitly typed variable result: RougeScores = {...}

Lines 279-302: Build Dict[str, Any] then cast at return

Both patterns work, but pattern 1 provides slightly better type safety during dictionary construction (mypy can catch typos in keys). Consider using pattern 1 consistently for better type checking.

Example refactor for consistency:
-    result_dict: Dict[str, Any] = {
+    result: BleuScores = {
         "corpus_bleu": corpus_bleu,
         "sentence_bleu": float(np.mean(sentence_scores)),
         # ... other fields
     }
     
-    return cast(BleuScores, result_dict)
+    return result
Also applies to: 279-302

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8fa061b and e595050.

📒 Files selected for processing (6)

benchwise/cli.py (10 hunks)
benchwise/client.py (23 hunks)
benchwise/core.py (12 hunks)
benchwise/metrics.py (34 hunks)
benchwise/types.py (1 hunks)
mypy.ini (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

benchwise/types.py (4)

benchwise/metrics.py (2)

accuracy (463-574)

evaluate (1527-1543)

benchwise/datasets.py (3)

size (44-46)

prompts (49-62)

references (65-79)

benchwise/results.py (1)

success (37-39)

benchwise/core.py (1)

evaluate (30-79)

benchwise/cli.py (2)

benchwise/types.py (4)

ConfigureArgs (449-456)

ConfigKwargs (472-477)

SyncArgs (459-462)

StatusArgs (465-469)

benchwise/metrics.py (2)

rouge_l (73-177)

semantic_similarity (577-703)

benchwise/core.py (4)

benchwise/types.py (1)

evaluate (442-446)

benchwise/datasets.py (1)

Dataset (11-210)

benchwise/results.py (1)

EvaluationResult (12-76)

benchwise/cli.py (1)

run_evaluation (115-286)

benchwise/metrics.py (2)

benchwise/types.py (10)

RougeScores (56-72)

BleuScores (75-96)

BertScoreResults (99-129)

AccuracyResults (132-154)

SemanticSimilarityResults (157-182)

PerplexityResults (185-193)

FactualCorrectnessResults (263-281)

CoherenceResults (204-221)

SafetyResults (232-252)

evaluate (442-446)

benchwise/core.py (1)

evaluate (30-79)

benchwise/client.py (1)

benchwise/types.py (1)

OfflineQueueItem (480-484)

🪛 GitHub Actions: CI

benchwise/metrics.py

[error] 14-14: ModuleNotFoundError: No module named 'rouge_score'. PyTest run failed while importing rouge_scorer from rouge_score during test collection. Command: python -m pytest -v tests/test_core.py tests/test_datasets.py tests/test_models.py tests/test_results.py tests/test_config.py --tb=short --color=yes --durations=5 --maxfail=10 --disable-warnings

🔇 Additional comments (8)

mypy.ini (1)

1-73: LGTM! Comprehensive mypy configuration.

The configuration enables strict type checking appropriately and handles third-party library stubs correctly. The use of follow_imports = skip for large libraries like transformers and google.generativeai is a good practice to avoid performance issues.

benchwise/cli.py (1)

289-323: LGTM! Protocol-based typing for CLI arguments.

The use of Protocol types (ConfigureArgs, ConfigKwargs, etc.) is an excellent pattern for typing argparse Namespace objects and function kwargs. This provides type safety without requiring concrete class dependencies.

benchwise/types.py (1)

1-484: LGTM! Well-structured centralized type definitions.

This module provides excellent type coverage with:

Appropriate use of TypedDict with total=False for optional fields

Well-documented Protocol interfaces for duck typing

Comprehensive metric result types matching the actual implementations

Clear separation of concerns (models, metrics, datasets, config, API)

The centralized types module will significantly improve IDE support and catch type errors early.

benchwise/client.py (1)

77-93: LGTM! Properly typed async context manager.

The context manager implementation correctly types the __aenter__ and __aexit__ methods with proper exception handling types. The typing improvements make the async context manager protocol explicit and checkable by mypy.

benchwise/metrics.py (1)

151-158: LGTM! Proper float conversions for numpy scalars.

The explicit float() conversions for numpy aggregations (mean, std, percentile) are correct and necessary. NumPy returns numpy scalar types (e.g., np.float64) which don't satisfy type checkers expecting float. This pattern ensures type compatibility.

benchwise/core.py (3)

30-79: LGTM! Advanced decorator typing with ParamSpec and Awaitable.

The decorator typing correctly captures the complex nested function signatures:

evaluate() returns a decorator function

The decorator accepts an async function and returns a wrapper

The wrapper preserves the async nature and returns typed results

The type: ignore comment on line 75 is appropriate for the dynamically added _benchmark_metadata attribute.

This is excellent use of Python's advanced typing features to provide strong type safety for decorator patterns.

176-223: LGTM! Correct typing for exception handling in stress tests.

The List[Union[R, BaseException]] return type correctly captures that stress test results can contain either successful results or exceptions. This aligns perfectly with asyncio.gather(..., return_exceptions=True) which returns a mix of results and exceptions.

The typing makes it explicit that callers must check each result to determine if it's an exception or a successful value.

317-334: LGTM! Clean typing for quick evaluation.

The function signature correctly types:

metric parameter as a callable that scores text: Callable[[str], float]

Return type as Dict[str, Optional[float]] to handle failures gracefully

This provides clear contracts for callers and enables proper type checking.

benchwise/metrics.py

…tadata

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

benchwise/core.py (1)
74-75: Consider using a Protocol for dynamic attributes.

The type: ignore comment is necessary here because _benchmark_metadata is dynamically attached by the @benchmark decorator. For improved type safety, consider defining a Protocol:
from typing import Protocol

class BenchmarkFunction(Protocol):
    _benchmark_metadata: Dict[str, Any]
Then use conditional checks with isinstance() or hasattr() paired with proper type guards. This would eliminate the need for type: ignore comments while maintaining runtime flexibility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e595050 and 420e169.

📒 Files selected for processing (1)

benchwise/core.py (12 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

benchwise/core.py (4)

benchwise/datasets.py (1)

Dataset (11-210)

benchwise/results.py (1)

EvaluationResult (12-76)

benchwise/models.py (1)

get_model_adapter (275-298)

benchwise/client.py (1)

upload_results (950-1020)

🔇 Additional comments (10)

benchwise/core.py (10)

1-25: LGTM! Type variable declarations are appropriate.

The addition of ParamSpec and TypeVar enables proper generic typing for decorators, preserving both parameter signatures and return types through decorator chains.

30-35: LGTM! Decorator signature accurately represents the evaluation pattern.

The complex generic return type correctly captures the decorator chain: a function that accepts an async test function and returns a wrapper that takes a Dataset and produces evaluation results.

82-90: LGTM! Function signature accurately reflects parameter types.

The use of tuple[str, ...] for the models parameter correctly represents that it originates from *models (variadic args), which creates a tuple. The addition of the wrapper_func parameter enables metadata access from the decorator chain.

154-174: LGTM! Generic signature preserves wrapped function's type.

The use of ParamSpec[P] and TypeVar[R] ensures that the @benchmark decorator doesn't alter the wrapped function's signature, enabling proper type checking for decorated functions. The type: ignore[attr-defined] on line 167 is consistent with the pattern on line 75.

177-224: LGTM! Generic signature correctly captures stress test behavior.

The return type List[Union[R, BaseException]] accurately represents the behavior of asyncio.gather(..., return_exceptions=True), which returns both successful results and exceptions. The use of ParamSpec preserves the wrapped function's parameter signature.

230-233: LGTM! Explicit attribute annotations improve type safety.

The explicit type annotations for self.config and self.results_cache, along with the -> None return type for __init__, enhance type checking. Using Dict[str, Any] is appropriate for flexible configuration storage.

235-254: LGTM! Method signature clearly defines expectations.

The explicit typing of test_func as Callable[..., Awaitable[Any]] and the annotated results list provide clear type information for this evaluation method.

256-307: LGTM! Optional parameter correctly typed and defensively handled.

The change to Optional[str] = None for metric_name is correct. The score extraction logic defensively handles missing or None scores by falling back to 0, ensuring robustness when result structures vary.

310-315: LGTM! Signature consistent with other async evaluation functions.

The typing of benchmark_func as Callable[..., Awaitable[Any]] maintains consistency with the evaluate and run_evaluation methods.

318-336: LGTM! Function signature and error handling properly typed.

The metric parameter typed as Callable[[str], float] clearly defines expectations, and the return type Dict[str, Optional[float]] correctly represents that evaluation can fail (None) for some models.

benchwise/core.py

… test gpt adapter

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

.github/workflows/ci.yml (2)
19-19: Verify Python version support scope.

The workflow only tests against Python 3.12. If the project is intended to support multiple Python versions (e.g., 3.9, 3.10, 3.11), consider adding a matrix strategy to catch version-specific compatibility issues early.

If multi-version support is needed, you can refactor the Python setup step to use a matrix:
  jobs:
    test:
      runs-on: ubuntu-latest
+     strategy:
+       matrix:
+         python-version: ["3.9", "3.10", "3.11", "3.12"]
  
      steps:
        - uses: actions/checkout@v4
  
        - name: Set up Python
          uses: actions/setup-python@v5
          with:
-           python-version: "3.12"
+           python-version: ${{ matrix.python-version }}
Also applies to: 19-19

21-24: Consider adding pip dependency caching for faster builds.

Caching pip dependencies can significantly reduce build times on subsequent runs. This is an optional performance optimization that works well for CI pipelines with frequent pushes.

Add pip caching before the install step:
      - uses: actions/checkout@v4

+     - name: Cache pip dependencies
+       uses: actions/cache@v4
+       with:
+         path: ~/.cache/pip
+         key: ${{ runner.os }}-pip-${{ hashFiles('**/pyproject.toml') }}
+         restore-keys: |
+           ${{ runner.os }}-pip-

      - name: Set up Python 3.12

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 420e169 and dcf430e.

📒 Files selected for processing (2)

.github/workflows/ci.yml (1 hunks)
tests/test_models.py (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/test_models.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

🔇 Additional comments (1)

.github/workflows/ci.yml (1)

26-36: Well-structured CI workflow with comprehensive checks.

The workflow covers all essential quality gates: linting, formatting, type checking, and testing. It aligns well with the PR's focus on improving type annotations and introducing strict mypy configuration. The sequential execution is appropriate for this pipeline. All supporting infrastructure is properly configured: mypy.ini contains the benchwise target, pyproject.toml defines the required optional dependencies, and run_tests.py is available with --basic flag support.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)

benchwise/client.py (3)

816-871: Offline sync is currently broken (wrong reconstruction + wrong kwargs).
Two runtime issues in sync_offline_queue():

BenchmarkResult(**benchmark_result_dict) won’t rebuild results into EvaluationResult objects (they’re dicts from to_dict()), so later result.success / result.to_dict() will crash.
The queued "create_evaluation" payload includes "config", but create_evaluation() doesn’t accept it—self.create_evaluation(**evaluation_data) will raise TypeError.

         for item in self.offline_queue:
             try:
                 queue_data: Dict[str, Any] = item["data"]
                 data_type: Any = queue_data.get("type")
@@
                 if data_type == "full_benchmark_result":
-                    # Reconstruct BenchmarkResult and upload
-                    from .results import BenchmarkResult
-
-                    benchmark_result_dict: Dict[str, Any] = queue_data.get(
-                        "benchmark_result", {}
-                    )
-                    benchmark_result = BenchmarkResult(**benchmark_result_dict)
+                    # Reconstruct BenchmarkResult + EvaluationResult objects and upload
+                    from .results import BenchmarkResult, EvaluationResult
+                    from datetime import datetime
+
+                    benchmark_result_dict: Dict[str, Any] = cast(
+                        Dict[str, Any], queue_data.get("benchmark_result", {})
+                    )
+                    raw_results: Any = benchmark_result_dict.get("results", [])
+                    results_objs: List[EvaluationResult] = []
+                    if isinstance(raw_results, list):
+                        for r in raw_results:
+                            if isinstance(r, dict):
+                                ts = r.get("timestamp")
+                                if isinstance(ts, str):
+                                    try:
+                                        r["timestamp"] = datetime.fromisoformat(ts)
+                                    except Exception:
+                                        pass
+                                # drop computed fields
+                                r_clean = {k: v for k, v in r.items() if k not in ("success", "failed")}
+                                results_objs.append(EvaluationResult(**r_clean))
+
+                    ts_bm = benchmark_result_dict.get("timestamp")
+                    bm_timestamp = datetime.now()
+                    if isinstance(ts_bm, str):
+                        try:
+                            bm_timestamp = datetime.fromisoformat(ts_bm)
+                        except Exception:
+                            pass
+
+                    benchmark_result = BenchmarkResult(
+                        benchmark_name=str(benchmark_result_dict.get("benchmark_name", "offline_benchmark")),
+                        results=results_objs,
+                        metadata=cast(Dict[str, Any], benchmark_result_dict.get("metadata", {})),
+                        timestamp=bm_timestamp,
+                    )
                     await self.upload_benchmark_result(benchmark_result)
                 elif data_type == "create_evaluation":
-                    evaluation_data: Dict[str, Any] = queue_data.get("data", {})
-                    await self.create_evaluation(**evaluation_data)
+                    evaluation_data: Dict[str, Any] = cast(Dict[str, Any], queue_data.get("data", {}))
+                    name = evaluation_data.get("name")
+                    benchmark_id = evaluation_data.get("benchmark_id")
+                    model_ids = evaluation_data.get("model_ids")
+                    metadata = evaluation_data.get("metadata")
+                    if isinstance(name, str) and isinstance(benchmark_id, int) and isinstance(model_ids, list):
+                        await self.create_evaluation(
+                            name=name,
+                            benchmark_id=benchmark_id,
+                            model_ids=cast(List[int], model_ids),
+                            metadata=cast(Optional[EvaluationMetadata], metadata),
+                        )

196-226: Avoid logging raw usernames/emails on login attempts.
logger.info(f"Attempting login for user: {username}") can leak PII into logs; consider redacting or logging at debug with a hash/truncated value.

109-174: Handle HTTP-date retry-after format and validate headers dict before mutation.

Line 135–137: int(response.headers.get("retry-after", ...)) will crash if a server sends HTTP-date format (e.g., "Fri, 31 Dec 1999 23:59:59 GMT"), which is valid per RFC 7231. The check at line 118 is insufficient—if a caller passes headers=None, line 120 will crash attempting to subscript None.

benchwise/results.py (2)

180-228: Ensure compare_models() only ranks numeric scores (can TypeError today).
valid_scores currently accepts any non-None score; if a metric returns a dict/string, sorted(..., key=lambda x: x[1]) and np.mean() will break.

-        valid_scores = [
-            (name, score)
-            for name, score in zip(model_names, scores)
-            if score is not None
-        ]
+        valid_scores: List[tuple[str, float]] = []
+        for name, score in zip(model_names, scores):
+            if isinstance(score, (int, float)):
+                valid_scores.append((name, float(score)))
@@
-        sorted_results = sorted(valid_scores, key=lambda x: x[1], reverse=True)
+        sorted_results = sorted(valid_scores, key=lambda x: x[1], reverse=True)

294-342: Make cross-benchmark output JSON-friendly (timestamp type mismatch).
benchmark_info["timestamp"] = benchmark.timestamp is a datetime, but your typed shape suggests a str timestamp (and you already use isoformat() in to_dict()).

             benchmark_info = {
                 "name": benchmark.benchmark_name,
-                "timestamp": benchmark.timestamp,
+                "timestamp": benchmark.timestamp.isoformat(),
                 "models": benchmark.model_names,
                 "success_rate": benchmark.success_rate,
             }

benchwise/core.py (1)

268-288: EvaluationRunner.run_evaluation() returns the wrong thing (likely breaks run_benchmark).
run_evaluation() promises List[EvaluationResult] but appends the raw result from test_func(model, dataset) (which is typically metrics/Any). This is a runtime bug if run_benchmark() is used.

     async def run_evaluation(
@@
     ) -> List[EvaluationResult]:
         """Run evaluation on multiple models."""
-        results: List[EvaluationResult] = []
+        results: List[EvaluationResult] = []
@@
         for model_name in models:
             try:
                 model = get_model_adapter(model_name)
-                result = await test_func(model, dataset)
-                results.append(result)
+                start_time = time.time()
+                payload = await test_func(model, dataset)
+                end_time = time.time()
+                results.append(
+                    EvaluationResult(
+                        model_name=model_name,
+                        test_name=getattr(test_func, "__name__", "evaluation"),
+                        result=payload,
+                        duration=end_time - start_time,
+                        dataset_info=convert_metadata_to_info(dataset.metadata)
+                        if dataset.metadata
+                        else None,
+                    )
+                )
             except Exception as e:
                 self.logger.error(f"Evaluation failed for {model_name}: {e}")
+                results.append(
+                    EvaluationResult(
+                        model_name=model_name,
+                        test_name=getattr(test_func, "__name__", "evaluation"),
+                        error=str(e),
+                        duration=0.0,
+                        dataset_info=convert_metadata_to_info(dataset.metadata)
+                        if dataset.metadata
+                        else None,
+                    )
+                )

Also applies to: 354-360

benchwise/datasets.py (1)

334-448: Fix URL handling order + add request timeouts (currently breaks remote URLs).

When source is a string URL like "https://host/data.json", it passes isinstance(source, (str, Path)) check, then Path(source).suffix returns ".json", causing the local .json file branch to execute first. This attempts open("https://host/data.json", "r"), which fails before the URL branch at line 399 ever runs. Additionally, requests.get() calls at line 401 lack a timeout parameter, risking indefinite hangs.

Move URL detection before the local file suffix checks, and add timeout=30 (or appropriate value) to requests.get() calls.

🧹 Nitpick comments (2)

docs/test_load_dataset.py (1)
22-42: LGTM! The __main__ block successfully makes this test runnable.

The addition allows developers to execute the test directly, which improves the demo/documentation experience. The approach of creating a dummy data file on-the-fly is appropriate for example code.

One optional improvement: consider using json.dump() instead of a raw string for better maintainability:
import json

if __name__ == "__main__":
    if not os.path.exists(data_file_path):
        dummy_data = [
            {"prompt": "What is the capital of France?", "reference": "Paris"},
            {"prompt": "Who wrote 'Hamlet'?", "reference": "William Shakespeare"}
        ]
        with open(data_file_path, "w") as f:
            json.dump(dummy_data, f, indent=2)
        print(f"Created dummy {data_file_path} for testing.")
    
    test_load_dataset_from_json()
However, the current raw string approach is perfectly acceptable for documentation purposes and may even be more readable since the exact JSON structure is visible.
benchwise/datasets.py (1)
136-140: Prefer SHA-256 over MD5 for dataset hashes (even if non-crypto).
MD5 is collision-prone; using SHA-256 is basically free here and avoids raising eyebrows for downstream consumers.
-        return hashlib.md5(data_str.encode()).hexdigest()
+        return hashlib.sha256(data_str.encode()).hexdigest()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3211ea and 1da265b.

📒 Files selected for processing (10)

.github/workflows/ci.yml (1 hunks)
.gitignore (1 hunks)
benchwise/cli.py (14 hunks)
benchwise/client.py (35 hunks)
benchwise/core.py (10 hunks)
benchwise/datasets.py (14 hunks)
benchwise/results.py (20 hunks)
benchwise/types.py (1 hunks)
demo.py (4 hunks)
docs/test_load_dataset.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.gitignore
benchwise/types.py

🧰 Additional context used

🧬 Code graph analysis (4)

demo.py (3)

benchwise/core.py (2)

evaluate (39-95)

benchmark (185-207)

benchwise/datasets.py (1)

create_qa_dataset (451-495)

benchwise/metrics.py (2)

accuracy (463-574)

semantic_similarity (577-703)

benchwise/cli.py (2)

benchwise/datasets.py (3)

convert_metadata_to_info (714-761)

references (163-177)

size (142-144)

benchwise/types.py (5)

ConfigKwargs (641-646)

SyncArgs (628-631)

StatusArgs (634-638)

DatasetInfo (326-338)

EvaluationMetadata (378-393)

benchwise/results.py (1)

benchwise/types.py (10)

DatasetInfo (326-338)

EvaluationMetadata (378-393)

EvaluationResultDict (396-407)

BenchmarkSummary (410-416)

BenchmarkResultDict (419-426)

ModelComparisonResult (436-447)

CrossBenchmarkComparison (694-699)

ModelPerformanceAnalysis (702-716)

CachedResultInfo (675-682)

BenchmarkComparisonInfo (685-691)

benchwise/core.py (3)

benchwise/datasets.py (3)

Dataset (106-331)

convert_metadata_to_info (714-761)

size (142-144)

benchwise/results.py (2)

EvaluationResult (25-94)

success (52-54)

benchwise/types.py (7)

RunnerConfig (656-664)

ModelComparisonResult (436-447)

EvaluationResultDict (396-407)

EvaluationMetadata (378-393)

DatasetInfo (326-338)

CallableWithBenchmarkMetadata (608-615)

evaluate (594-598)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

🔇 Additional comments (4)

.github/workflows/ci.yml (1)

3-8: CI looks solid; verify repo paths/extras and consider pip caching.

Please confirm mypy.ini is at repo root (since --config-file=mypy.ini) and run_tests.py is in the working directory the workflow runs from.

If your tests live under benchwise/tests (or another path), ruff check benchwise tests / ruff format --check benchwise tests may miss or fail—worth confirming the directory names.

Optional: add actions/cache (pip) to speed up repeated runs.

Also applies to: 21-36

demo.py (1)

30-42: Demo usage matches the @evaluate wrapper contract.
await test_general_knowledge(qa_dataset) returning an iterable of per-model results is consistent with the decorator behavior, and printing result.result[...] aligns with the EvaluationResult shape.

Also applies to: 45-57

benchwise/cli.py (1)

115-155: DatasetInfo conversion + upload fallback are clean and consistent.
convert_metadata_to_info(dataset.metadata) usage (with None fallback) keeps evaluation results and uploads aligned with the new typed DatasetInfo model.

Also applies to: 243-309

benchwise/core.py (1)

39-183: evaluate() + _run_evaluation() integration looks consistent (metadata + DatasetInfo).
The CallableWithBenchmarkMetadata cast approach avoids the old attr-defined issue while keeping runtime behavior straightforward, and the DatasetInfo fallback for uploads is sensible.

benchwise/datasets.py

…DatasetDict

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

benchwise/datasets.py (2)
122-134: Consider simplifying the metadata initialization.

The nested if checks can be consolidated into a single block. The current approach casts an empty dict, then immediately checks if it's empty and casts again with defaults.

Apply this diff to simplify:
 def __post_init__(self) -> None:
-    if self.metadata is None:
-        self.metadata = cast(DatasetMetadata, {})
-
-    if not self.metadata:
-        self.metadata = cast(
-            DatasetMetadata,
-            {
-                "size": len(self.data),
-                "created_at": pd.Timestamp.now().isoformat(),
-                "hash": self._compute_hash(),
-            },
-        )
+    if not self.metadata:
+        self.metadata = cast(
+            DatasetMetadata,
+            {
+                "size": len(self.data),
+                "created_at": pd.Timestamp.now().isoformat(),
+                "hash": self._compute_hash(),
+            },
+        )
458-502: Consistent and well-structured dataset creation functions.

All three creation functions follow a uniform pattern with proper type annotations, explicit schemas, and appropriate metadata. The use of cast() aligns with the PR's typing improvements.

Minor note: The name extraction pattern (checking isinstance(kwargs.get("name"), str)) is repeated in all three functions. If you add more creation functions in the future, consider extracting this into a helper, but it's fine as-is for now.

Also applies to: 505-549, 552-597

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1da265b and 1f7352b.

📒 Files selected for processing (1)

benchwise/datasets.py (14 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

benchwise/datasets.py (1)

benchwise/types.py (5)

DatasetItem (285-301)

DatasetMetadata (304-313)

DatasetSchema (316-323)

DatasetDict (349-355)

DatasetInfo (326-338)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

🔇 Additional comments (6)

benchwise/datasets.py (6)

346-367: LGTM—DatasetDict name field is now respected.

The implementation now properly extracts and validates the name from the incoming DatasetDict before falling back to kwargs, addressing the previous review feedback.

362-367: Good use of centralized validation helpers.

Consistently applying the validation functions across all loading paths ensures uniform type checking and error messages.

721-768: LGTM—pragmatic approach to handling TypedDict limitations.

The cast to Dict[str, Any] at line 746 is a reasonable workaround for accessing fields like "hash", "task", and "difficulty" that may exist at runtime but aren't defined in the DatasetMetadata TypedDict. This function properly handles missing fields and provides proper type safety for the returned DatasetInfo.

321-329: Type narrowing approach is sound.

The explicit filtering at lines 322-324 helps mypy understand that numeric_values contains only int or float types, even though the all() check at line 320 already verified this. This is a reasonable tradeoff for strict type checking.

603-620: LGTM—proper return type annotations.

The DatasetRegistry methods now have explicit return type annotations, completing the type coverage for this class.

627-718: LGTM—consistent type annotations in sample loaders.

All three sample dataset functions now have proper type annotations with cast() calls, maintaining consistency with the rest of the module's typing improvements.

coderabbitai · 2025-12-13T12:27:18Z

benchwise/datasets.py

+def _validate_dataset_item(item: Any) -> DatasetItem:
+    """
+    Validate and convert a dictionary to DatasetItem.
+
+    Args:
+        item: Dictionary or any value to validate
+
+    Returns:
+        Validated DatasetItem
+
+    Raises:
+        ValueError: If item is not a dictionary
+    """
+    if not isinstance(item, dict):
+        raise ValueError(f"Expected dict for DatasetItem, got {type(item).__name__}")
+    return cast(DatasetItem, item)
+
+
+def _validate_dataset_items(items: Any) -> List[DatasetItem]:
+    """
+    Validate and convert a list of dictionaries to List[DatasetItem].
+
+    Args:
+        items: List of dictionaries or any value to validate
+
+    Returns:
+        Validated List[DatasetItem]
+
+    Raises:
+        ValueError: If items is not a list or contains non-dict items
+    """
+    if not isinstance(items, list):
+        raise ValueError(f"Expected list for dataset data, got {type(items).__name__}")
+
+    validated_items: List[DatasetItem] = []
+    for i, item in enumerate(items):
+        if not isinstance(item, dict):
+            raise ValueError(
+                f"Expected dict for dataset item at index {i}, got {type(item).__name__}"
+            )
+        validated_items.append(cast(DatasetItem, item))
+
+    return validated_items
+
+
+def _validate_dataset_metadata(metadata: Any) -> Optional[DatasetMetadata]:
+    """
+    Validate and convert metadata to DatasetMetadata.
+
+    Args:
+        metadata: Dictionary or None to validate
+
+    Returns:
+        Validated DatasetMetadata or None
+    """
+    if metadata is None:
+        return None
+
+    if not isinstance(metadata, dict):
+        raise ValueError(
+            f"Expected dict or None for DatasetMetadata, got {type(metadata).__name__}"
+        )
+
+    return cast(DatasetMetadata, metadata)
+
+
+def _validate_dataset_schema(schema: Any) -> Optional[DatasetSchema]:
+    """
+    Validate and convert schema to DatasetSchema.
+
+    Args:
+        schema: Dictionary or None to validate
+
+    Returns:
+        Validated DatasetSchema or None
+    """
+    if schema is None:
+        return None
+
+    if not isinstance(schema, dict):
+        raise ValueError(
+            f"Expected dict or None for DatasetSchema, got {type(schema).__name__}"
+        )
+
+    return cast(DatasetSchema, schema)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and imports head -20 benchwise/datasets.py

Repository: Benchwise/benchwise

Length of output: 485

🏁 Script executed:

# Search for where these validation functions are called rg "_validate_dataset" --type python -A 3 -B 1

Repository: Benchwise/benchwise

Length of output: 149

🏁 Script executed:

# Look at the TypedDict definitions to understand what structure is expected rg "TypedDict|DatasetItem|DatasetMetadata|DatasetSchema" --type python -B 2 -A 5

Repository: Benchwise/benchwise

Length of output: 149

🏁 Script executed:

# Check if there are any other validation patterns in the file rg "isinstance|validate|check" benchwise/datasets.py -A 2

Repository: Benchwise/benchwise

Length of output: 5607

🏁 Script executed:

# Look at the PR discussion context - search for cast usage and related comments rg "cast\(" benchwise/datasets.py -B 2 -A 2

Repository: Benchwise/benchwise

Length of output: 4970

🏁 Script executed:

# Check the types module to see TypedDict definitions find . -name "types.py" -type f | head -5

Repository: Benchwise/benchwise

Length of output: 83

🏁 Script executed:

# Look at the types definitions cat benchwise/types.py | head -100

Repository: Benchwise/benchwise

Length of output: 2413

🏁 Script executed:

# Check what keys are actually accessed in Dataset methods rg "self\.data\[|self\.metadata\[|self\.schema\[" benchwise/datasets.py -A 2

Repository: Benchwise/benchwise

Length of output: 414

🏁 Script executed:

# Check if there are required fields in the TypedDict definitions rg "class DatasetItem|class DatasetMetadata|class DatasetSchema|TypedDict" benchwise/types.py -A 10

Repository: Benchwise/benchwise

Length of output: 11637

🏁 Script executed:

# Look at the validate_schema method and how it accesses schema keys rg -A 20 "def validate_schema" benchwise/datasets.py

Repository: Benchwise/benchwise

Length of output: 788

🏁 Script executed:

# Check other methods that access metadata or schema keys rg "self\.schema\[|self\.metadata\[" benchwise/datasets.py -B 2 -A 3

Repository: Benchwise/benchwise

Length of output: 394

🏁 Script executed:

# Look at the stats method that accesses data[0].keys() rg -A 30 "def get_stats" benchwise/datasets.py

Repository: Benchwise/benchwise

Length of output: 45

Consider adding structural validation for safety-critical scenarios.

The validation functions only verify that inputs are dicts or lists, then use cast() without checking field structure. While this is intentional given the TypedDict definitions use total=False (all fields optional), methods like validate_schema() that access specific keys should include field existence checks throughout. For data loaded from external sources (files, remote APIs), consider adding structural validation—either by checking for required fields in the validation functions or by adopting a schema validation library like Pydantic—to catch malformed data early and provide clearer error messages.

🤖 Prompt for AI Agents

In benchwise/datasets.py around lines 18-102 the helper validators only check top-level types and then cast, which lets malformed or missing fields slip through; update each validator to perform structural checks for the fields your code later relies on (e.g., required keys and their expected primitive types), returning the validated dict only if required keys exist and types match and otherwise raising descriptive ValueError messages; alternatively replace these simple functions with lightweight Pydantic models (or similar schema validators) for DatasetItem, DatasetMetadata, and DatasetSchema and use model.parse_obj to validate and produce clear errors for external data.

…d enhance type safety by removing unused imports and organizing dependencies

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

benchwise/config.py (1)
414-419: Remove redundant bool() wrapper.

The comparison response.status_code == 200 already returns a boolean, so wrapping it with bool() is unnecessary.

Apply this diff:
         async def check_connection() -> bool:
             async with httpx.AsyncClient(timeout=5.0) as client:
                 response = await client.get(f"{config.api_url}/health")
-                return bool(response.status_code == 200)
+                return response.status_code == 200
benchwise/datasets.py (1)
947-947: Use Set[str] from typing for consistency.

The file uses typed imports like List, Dict, Optional from the typing module throughout, but here uses the lowercase set[str] syntax (Python 3.9+). For consistency with the rest of the codebase, use Set[str] from typing.

Apply this diff:
+from typing import List, Dict, Any, Optional, Union, Callable, cast, Set
...
-    important_ref_words: set[str] = set()
+    important_ref_words: Set[str] = set()
benchwise/client.py (1)
186-195: Remove redundant bool() wrapper.

The comparison response.status_code == 200 already returns a boolean, so wrapping it with bool() is unnecessary.

Apply this diff:
     async def health_check(self) -> bool:
         """Check if the Benchwise API is available."""
         try:
             response = await self.client.get("/health", timeout=5.0)
-            is_healthy = bool(response.status_code == 200)
+            is_healthy = response.status_code == 200
             logger.info(f"Health check: {'healthy' if is_healthy else 'unhealthy'}")
             return is_healthy

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f7352b and 37c697a.

📒 Files selected for processing (5)

benchwise/cli.py (14 hunks)
benchwise/client.py (35 hunks)
benchwise/config.py (15 hunks)
benchwise/datasets.py (12 hunks)
benchwise/metrics.py (34 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

benchwise/config.py (1)

benchwise/types.py (1)

ConfigDict (359-374)

benchwise/cli.py (3)

benchwise/datasets.py (3)

convert_metadata_to_info (718-765)

references (164-178)

size (143-145)

benchwise/client.py (1)

upload_results (992-1067)

benchwise/types.py (6)

ConfigureArgs (618-625)

ConfigKwargs (641-646)

SyncArgs (628-631)

StatusArgs (634-638)

DatasetInfo (326-338)

EvaluationMetadata (378-393)

benchwise/datasets.py (1)

benchwise/types.py (5)

DatasetItem (285-301)

DatasetMetadata (304-313)

DatasetSchema (316-323)

DatasetDict (349-355)

DatasetInfo (326-338)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

yanurag-dev added 11 commits December 1, 2025 17:27

feat: Add type annotations and mypy configuration

c75c26e

- Add benchwise/types.py with TypedDict, Protocol, and Literal definitions - Add mypy.ini with strict type checking - Add py.typed marker for PEP 561 compliance - Improve type annotations in exceptions, config, and datasets modules

feat: Add type stubs for pandas and requests

4cb069b

fix: Use metadata variable in Dataset.sample

469f4c9

feat: Start metrics.py type improvements

d1f6318

feat: Fix all type errors in metrics.py and logging.py

3d94831

fix: Configure mypy to skip google.generativeai type checking

7d2b93c

fix: Add proper type annotations to config, models, results, core, an…

2a08836

…d cli

fix: Complete type annotations for datasets.py and client.py

a7a2a2e

All mypy errors resolved (82 → 0). Fixed type narrowing, cast() calls for response.json(), and missing return type annotations. Tests: 124/126 passing.

feat: Add mypy to CI/CD and documentation

f38a7f9

Added GitHub Actions CI workflow, pre-commit hooks, and README development section with type checking info.

chore: Update CI to test only Python 3.12

87bb74c

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

fix: Remove unused imports and fix type errors

e6254c3

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

fix: Remove unused imports and fix type errors

88c052d

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

yanurag-dev added 2 commits December 3, 2025 01:02

chore(pre-commit): Update ruff-pre-commit revision to v0.14.7

8b21a2f

fix(format): Apply latest ruff formatting and update pre-commit config

8fa061b

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

yanurag-dev added 2 commits December 11, 2025 16:01

chore: Update mypy configuration for Python 3.12 and enhance type ann…

e0a72b4

…otations in cli.py, client.py, core.py, and metrics.py

chore: Remove GEMINI.md file from repo

e595050

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

benchwise/metrics.py Show resolved Hide resolved

fix(core): Update evaluation function to use wrapper for benchmark me…

420e169

…tadata

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

benchwise/core.py Outdated Show resolved Hide resolved

chore(ci): Update CI workflow to install metrics dependencies and fix…

dcf430e

… test gpt adapter

coderabbitai bot reviewed Dec 11, 2025

View reviewed changes

yanurag-dev added 2 commits December 11, 2025 18:10

chore(ci): Update CI workflow to install all development dependencies

a3211ea

refactor: Improve type safety across codebase with TypedDicts

1da265b

coderabbitai bot reviewed Dec 13, 2025

View reviewed changes

benchwise/datasets.py Show resolved Hide resolved

fix(datasets): Enhance dataset loading logic to prioritize name from …

1f7352b

…DatasetDict

coderabbitai bot reviewed Dec 13, 2025

View reviewed changes

refactor(cli, client, config, datasets, metrics): Clean up imports an…

37c697a

…d enhance type safety by removing unused imports and organizing dependencies

coderabbitai bot reviewed Dec 13, 2025

View reviewed changes

Conversation

yanurag-dev commented Dec 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

yanurag-dev commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

yanurag-dev commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yanurag-dev commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025

When cast() is fine (like your PR):

Important considerations:

Better alternatives (for consideration):

Verdict for your PR:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

yanurag-dev commented Dec 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 2, 2025 •

edited

Loading

When `cast()` is fine (like your PR):